{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "validation.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": [
        "4Xp9NhOCYSuz",
        "pECTKgw5ZvFK",
        "dER2_43pWj1T",
        "I-La4N9ObC1x",
        "yTghc_5HkJDW",
        "copyright-notice"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "copyright-notice",
        "colab_type": "text"
      },
      "source": [
        "#### Copyright 2017 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "copyright-notice2",
        "colab_type": "code",
        "cellView": "both",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "outputs": [],
      "execution_count": 0
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zbIgBK-oXHO7",
        "colab_type": "text"
      },
      "source": [
        " # Validaci\u00f3n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WNX0VyBpHpCX",
        "colab_type": "text"
      },
      "source": [
        " **Objetivos de aprendizaje:**\n",
        "  * usar varios atributos, en lugar de uno solo, para mejorar a\u00fan m\u00e1s la eficacia de un modelo\n",
        "  * depurar problemas en los datos de entrada del modelo\n",
        "  * usar un conjunto de datos de prueba para comprobar si un modelo est\u00e1 realizando un sobreajuste en los datos de validaci\u00f3n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "za0m1T8CHpCY",
        "colab_type": "text"
      },
      "source": [
        " Al igual que en los ejercicios anteriores, trabajamos con el conjunto de datos de viviendas en California para intentar predecir `median_house_value` a nivel de manzana en la ciudad a partir de los datos del censo de 1990."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r2zgMfWDWF12",
        "colab_type": "text"
      },
      "source": [
        " ## Preparaci\u00f3n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8jErhkLzWI1B",
        "colab_type": "text"
      },
      "source": [
        " Primero, carguemos y preparemos nuestros datos. Esta vez, trabajaremos con varios atributos, de manera que usaremos un sistema modular en la l\u00f3gica para procesar los atributos:"
      ]
    },
    {
      "metadata": {
        "id": "PwS5Bhm6HpCZ",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "from __future__ import print_function\n",
        "\n",
        "import math\n",
        "\n",
        "from IPython import display\n",
        "from matplotlib import cm\n",
        "from matplotlib import gridspec\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import metrics\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.data import Dataset\n",
        "\n",
        "tf.logging.set_verbosity(tf.logging.ERROR)\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "\n",
        "# california_housing_dataframe = california_housing_dataframe.reindex(\n",
        "#     np.random.permutation(california_housing_dataframe.index))"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "J2ZyTzX0HpCc",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def preprocess_features(california_housing_dataframe):\n",
        "  \"\"\"Prepares input features from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the features to be used for the model, including\n",
        "    synthetic features.\n",
        "  \"\"\"\n",
        "  selected_features = california_housing_dataframe[\n",
        "    [\"latitude\",\n",
        "     \"longitude\",\n",
        "     \"housing_median_age\",\n",
        "     \"total_rooms\",\n",
        "     \"total_bedrooms\",\n",
        "     \"population\",\n",
        "     \"households\",\n",
        "     \"median_income\"]]\n",
        "  processed_features = selected_features.copy()\n",
        "  # Create a synthetic feature.\n",
        "  processed_features[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"total_rooms\"] /\n",
        "    california_housing_dataframe[\"population\"])\n",
        "  return processed_features\n",
        "\n",
        "def preprocess_targets(california_housing_dataframe):\n",
        "  \"\"\"Prepares target features (i.e., labels) from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the target feature.\n",
        "  \"\"\"\n",
        "  output_targets = pd.DataFrame()\n",
        "  # Scale the target to be in units of thousands of dollars.\n",
        "  output_targets[\"median_house_value\"] = (\n",
        "    california_housing_dataframe[\"median_house_value\"] / 1000.0)\n",
        "  return output_targets"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sZSIaDiaHpCf",
        "colab_type": "text"
      },
      "source": [
        " Para el **conjunto de entrenamiento**, elegiremos los primeros 12,000 ejemplos del total de 17,000."
      ]
    },
    {
      "metadata": {
        "id": "P9wejvw7HpCf",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "training_examples = preprocess_features(california_housing_dataframe.head(12000))\n",
        "training_examples.describe()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "JlkgPR-SHpCh",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "training_targets = preprocess_targets(california_housing_dataframe.head(12000))\n",
        "training_targets.describe()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5l1aA2xOHpCj",
        "colab_type": "text"
      },
      "source": [
        " Para el **conjunto de validaci\u00f3n**, elegiremos los \u00faltimos 5,000 ejemplos del total de 17,000."
      ]
    },
    {
      "metadata": {
        "id": "fLYXLWAiHpCk",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "validation_examples = preprocess_features(california_housing_dataframe.tail(5000))\n",
        "validation_examples.describe()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "oVPcIT3BHpCm",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))\n",
        "validation_targets.describe()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z3TZV1pgfZ1n",
        "colab_type": "text"
      },
      "source": [
        " ## Tarea\u00a01: Examina los datos\n",
        "Observemos los datos que aparecen m\u00e1s arriba. Tenemos `9` atributos de entrada que podemos usar.\n",
        "\n",
        "Haz una lectura r\u00e1pida de la tabla de valores. \u00bfSe ve todo bien? Observa cu\u00e1ntos problemas puedes detectar. No te preocupes si no tienes formaci\u00f3n en estad\u00edstica; el sentido com\u00fan te llevar\u00e1 lejos.\n",
        "\n",
        "Una vez que hayas tenido la oportunidad de revisar los datos por tu cuenta, comprueba la soluci\u00f3n para obtener otras ideas sobre c\u00f3mo verificar los datos."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Xp9NhOCYSuz",
        "colab_type": "text"
      },
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gqeRmK57YWpy",
        "colab_type": "text"
      },
      "source": [
        " Comprobemos nuestros datos con algunas explicaciones de punto de referencia:\n",
        "\n",
        "* Para algunos valores, como `median_house_value`, podemos ver si estos valores est\u00e1n dentro de rangos razonables (teniendo en cuenta que se trata de datos de 1990, no actuales).\n",
        "\n",
        "* Para otros valores, como `latitude` y `longitude`, podemos hacer una comprobaci\u00f3n r\u00e1pida para ver si se alinean con los valores esperados a partir de una b\u00fasqueda r\u00e1pida en Google.\n",
        "\n",
        "Si observas m\u00e1s detenidamente, es posible que detectes algunas singularidades:\n",
        "\n",
        "* `median_income` est\u00e1 en una escala de aproximadamente 3 a 15. No est\u00e1 claro a qu\u00e9 se refiere esta escala. \u00bfTal vez sea una escala logar\u00edtmica? No est\u00e1 documentada en ninguna parte; todo lo que podemos suponer es que los valores m\u00e1s altos corresponden a ingresos m\u00e1s altos.\n",
        "\n",
        "* El `median_house_value` m\u00e1ximo es 500,001. Esto parece un l\u00edmite artificial de alg\u00fan tipo.\n",
        "\n",
        "* Nuestro atributo `rooms_per_person` generalmente est\u00e1 en una escala razonable, con un valor percentil 75\u00ba de aproximadamente 2. Pero hay algunos valores muy altos, como 18 o 55, los cuales pueden indicar cierto grado de da\u00f1o en los datos.\n",
        "\n",
        "Por el momento, usaremos estos atributos como est\u00e1n. Pero esperamos que estos tipos de ejemplos puedan ayudarte a desarrollar algo de intuici\u00f3n sobre c\u00f3mo comprobar los datos que llegan a ti de una fuente desconocida."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fXliy7FYZZRm",
        "colab_type": "text"
      },
      "source": [
        " ## Tarea\u00a02: Representa latitud/longitud frente a valor mediano de las casas"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aJIWKBdfsDjg",
        "colab_type": "text"
      },
      "source": [
        " Observemos detenidamente dos atributos en particular: **`latitude`** y **`longitude`**. Estas son las coordenadas geogr\u00e1ficas de la manzana en cuesti\u00f3n.\n",
        "\n",
        "Podr\u00eda parecer una buena visualizaci\u00f3n; representemos `latitude` y `longitude`, y usemos colores para mostrar el `median_house_value`."
      ]
    },
    {
      "metadata": {
        "id": "5_LD23bJ06TW",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "test": {
            "output": "ignore",
            "timeout": 600
          }
        },
        "cellView": "both"
      },
      "source": [
        "plt.figure(figsize=(13, 8))\n",
        "\n",
        "ax = plt.subplot(1, 2, 1)\n",
        "ax.set_title(\"Validation Data\")\n",
        "\n",
        "ax.set_autoscaley_on(False)\n",
        "ax.set_ylim([32, 43])\n",
        "ax.set_autoscalex_on(False)\n",
        "ax.set_xlim([-126, -112])\n",
        "plt.scatter(validation_examples[\"longitude\"],\n",
        "            validation_examples[\"latitude\"],\n",
        "            cmap=\"coolwarm\",\n",
        "            c=validation_targets[\"median_house_value\"] / validation_targets[\"median_house_value\"].max())\n",
        "\n",
        "ax = plt.subplot(1,2,2)\n",
        "ax.set_title(\"Training Data\")\n",
        "\n",
        "ax.set_autoscaley_on(False)\n",
        "ax.set_ylim([32, 43])\n",
        "ax.set_autoscalex_on(False)\n",
        "ax.set_xlim([-126, -112])\n",
        "plt.scatter(training_examples[\"longitude\"],\n",
        "            training_examples[\"latitude\"],\n",
        "            cmap=\"coolwarm\",\n",
        "            c=training_targets[\"median_house_value\"] / training_targets[\"median_house_value\"].max())\n",
        "_ = plt.plot()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "32_DbjnfXJlC",
        "colab_type": "text"
      },
      "source": [
        " Espera un momento\u2026 Esto nos deber\u00eda haber dado un hermoso mapa del estado de California, con el color rojo para representar las \u00e1reas costosas, como San Francisco y Los \u00c1ngeles.\n",
        "\n",
        "El conjunto de entrenamiento s\u00ed lo representa, en comparaci\u00f3n con un [mapa real](https://www.google.com/maps/place/California/@37.1870174,-123.7642688,6z/data=!3m1!4b1!4m2!3m1!1s0x808fb9fe5f285e3d:0x8b5109a227086f55), pero claramente el conjunto de validaci\u00f3n no.\n",
        "\n",
        "**Regresa y vuelve a observar los datos de la Tarea\u00a01.**\n",
        "\n",
        "\u00bfPuedes ver otras diferencias en las distribuciones de los atributos u objetivos entre los datos de validaci\u00f3n y los de entrenamiento?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pECTKgw5ZvFK",
        "colab_type": "text"
      },
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "49NC4_KIZxk_",
        "colab_type": "text"
      },
      "source": [
        " Al observar las tablas de las estad\u00edsticas de resumen anteriores, f\u00e1cilmente te preguntar\u00e1s c\u00f3mo se puede hacer una comprobaci\u00f3n \u00fatil de los datos. \u00bfCu\u00e1l es el valor del percentil 75\u00ba para total_rooms por manzana?\n",
        "\n",
        "El punto clave que se debe observar es que, para cualquier atributo o columna espec\u00edfica, la distribuci\u00f3n de los valores entre las divisiones de entrenamiento y validaci\u00f3n debe ser casi igual.\n",
        "\n",
        "El hecho de que esto no sea as\u00ed es una verdadera preocupaci\u00f3n y demuestra que es probable que tengamos una falla en la forma en que se cre\u00f3 nuestra divisi\u00f3n de entrenamiento y validaci\u00f3n."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "025Ky0Dq9ig0",
        "colab_type": "text"
      },
      "source": [
        " ## Tarea\u00a03: Regresa al c\u00f3digo de procesamiento previo e importaci\u00f3n de datos para ver si se detectan errores\n",
        "Si detectas errores, soluci\u00f3nalos. No pases m\u00e1s de un minuto o dos observando. Si no pueden encontrar el error, comprueba la soluci\u00f3n."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JFsd2eWHAMdy",
        "colab_type": "text"
      },
      "source": [
        " Cuando hayas encontrado y solucionado el problema, vuelve a ejecutar la celda de representaci\u00f3n de `latitude`/`longitude` anterior y confirma si las comprobaciones de estado dan mejores resultados.\n",
        "\n",
        "Por cierto, esto representa una lecci\u00f3n importante.\n",
        "\n",
        "**Con frecuencia, la depuraci\u00f3n en AA es *depuraci\u00f3n de datos* en lugar de depuraci\u00f3n de c\u00f3digo.**\n",
        "\n",
        "Si los datos no son correctos, incluso el c\u00f3digo de AA m\u00e1s avanzado ser\u00e1 incapaz de resolver los problemas."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dER2_43pWj1T",
        "colab_type": "text"
      },
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BnEVbYJvW2wu",
        "colab_type": "text"
      },
      "source": [
        " Observa c\u00f3mo se aleatorizan los datos cuando se leen.\n",
        "\n",
        "Si no aleatorizamos los datos correctamente antes de crear las divisiones de entrenamiento y validaci\u00f3n, es posible que tengamos problemas si los datos se proporcionan en un cierto orden; ese parece ser el caso aqu\u00ed."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xCdqLpQyAos2",
        "colab_type": "text"
      },
      "source": [
        " ## Tarea\u00a04: Entrena y evalua un modelo\n",
        "**Dedica alrededor de 5\u00a0minutos a probar diferentes configuraciones de hiperpar\u00e1metros. Intenta obtener el mejor rendimiento de validaci\u00f3n posible.**\n",
        "A continuaci\u00f3n, entrenaremos un regresor lineal con todos los atributos del conjunto de datos y veremos qu\u00e9 tan bien se desempe\u00f1a.\n",
        "Definamos la misma funci\u00f3n de entrada que usamos anteriormente para cargar los datos en un modelo de TensorFlow.\n"
      ]
    },
    {
      "metadata": {
        "id": "rzcIPGxxgG0t",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):\n",
        "    \"\"\"Trains a linear regression model of multiple features.\n",
        "  \n",
        "    Args:\n",
        "      features: pandas DataFrame of features\n",
        "      targets: pandas DataFrame of targets\n",
        "      batch_size: Size of batches to be passed to the model\n",
        "      shuffle: True or False. Whether to shuffle the data.\n",
        "      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely\n",
        "    Returns:\n",
        "      Tuple of (features, labels) for next data batch\n",
        "    \"\"\"\n",
        "    \n",
        "    # Convert pandas data into a dict of np arrays.\n",
        "    features = {key:np.array(value) for key,value in dict(features).items()}                                           \n",
        " \n",
        "    # Construct a dataset, and configure batching/repeating.\n",
        "    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit\n",
        "    ds = ds.batch(batch_size).repeat(num_epochs)\n",
        "    \n",
        "    # Shuffle the data, if specified.\n",
        "    if shuffle:\n",
        "      ds = ds.shuffle(10000)\n",
        "    \n",
        "    # Return the next batch of data.\n",
        "    features, labels = ds.make_one_shot_iterator().get_next()\n",
        "    return features, labels"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CvrKoBmNgRCO",
        "colab_type": "text"
      },
      "source": [
        " Como ahora estamos trabajando con varios atributos de entrada, usemos un sistema modular en nuestro c\u00f3digo para configurar columnas de atributos en un atributo independiente. (Por ahora, este c\u00f3digo es bastante simple, porque todos nuestros atributos son num\u00e9ricos. Sin embargo, desarrollaremos mejor este c\u00f3digo a medida que usemos otros tipos de atributos en ejercicios futuros)."
      ]
    },
    {
      "metadata": {
        "id": "wEW5_XYtgZ-H",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def construct_feature_columns(input_features):\n",
        "  \"\"\"Construct the TensorFlow Feature Columns.\n",
        "\n",
        "  Args:\n",
        "    input_features: The names of the numerical input features to use.\n",
        "  Returns:\n",
        "    A set of feature columns\n",
        "  \"\"\" \n",
        "  return set([tf.feature_column.numeric_column(my_feature)\n",
        "              for my_feature in input_features])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "D0o2wnnzf8BD",
        "colab_type": "text"
      },
      "source": [
        " A continuaci\u00f3n, avancemos y completemos el c\u00f3digo de `train_model()` m\u00e1s abajo para configurar las funciones de entrada y calcular las predicciones.\n",
        "\n",
        "**NOTA:** Es correcto hacer referencia al c\u00f3digo de los ejercicios anteriores, pero debes asegurarte de invocar `predict()` en los conjuntos de datos adecuados.\n",
        "\n",
        "Compara las p\u00e9rdidas en los datos de entrenamiento y los datos de validaci\u00f3n. Con un solo atributo sin procesar, nuestro mejor error de la ra\u00edz cuadrada de la media (RMSE) fue de alrededor de 180.\n",
        "\n",
        "Observa cu\u00e1nto mejora el desempe\u00f1o ahora que podemos usar atributos m\u00faltiples.\n",
        "\n",
        "Comprueba los datos con alguno de los m\u00e9todos que observamos antes. Entre estos se incluyen los siguientes:\n",
        "\n",
        "   * Comparaci\u00f3n de distribuciones de predicciones y valores objetivo reales\n",
        "\n",
        "   * Creaci\u00f3n de una representaci\u00f3n de dispersi\u00f3n de predicciones frente a valores objetivo\n",
        "\n",
        "   * Creaci\u00f3n de dos representaciones de dispersi\u00f3n de datos de validaci\u00f3n con `latitude` y `longitude`:\n",
        "      * Una representaci\u00f3n asigna el color al `median_house_value` objetivo real.\n",
        "      * Una segunda representaci\u00f3n asigna el color al `median_house_value` predicho para la comparaci\u00f3n en paralelo."
      ]
    },
    {
      "metadata": {
        "id": "UXt0_4ZTEf4V",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "test": {
            "output": "ignore",
            "timeout": 600
          }
        },
        "cellView": "both"
      },
      "source": [
        "def train_model(\n",
        "    learning_rate,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear regression model of multiple features.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearRegressor` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "  \n",
        "  # Create a linear regressor object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_regressor = tf.estimator.LinearRegressor(\n",
        "      feature_columns=construct_feature_columns(training_examples),\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "  \n",
        "  # 1. Create input functions.\n",
        "  training_input_fn = # YOUR CODE HERE\n",
        "  predict_training_input_fn = # YOUR CODE HERE\n",
        "  predict_validation_input_fn = # YOUR CODE HERE\n",
        "  \n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"RMSE (on training data):\")\n",
        "  training_rmse = []\n",
        "  validation_rmse = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_regressor.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period,\n",
        "    )\n",
        "    # 2. Take a break and compute predictions.\n",
        "    training_predictions = # YOUR CODE HERE\n",
        "    validation_predictions = # YOUR CODE HERE\n",
        "    \n",
        "    # Compute training and validation loss.\n",
        "    training_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(training_predictions, training_targets))\n",
        "    validation_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(validation_predictions, validation_targets))\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, training_root_mean_squared_error))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_rmse.append(training_root_mean_squared_error)\n",
        "    validation_rmse.append(validation_root_mean_squared_error)\n",
        "  print(\"Model training finished.\")\n",
        "\n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"RMSE\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"Root Mean Squared Error vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_rmse, label=\"training\")\n",
        "  plt.plot(validation_rmse, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_regressor"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "zFFRmvUGh8wd",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_regressor = train_model(\n",
        "    # TWEAK THESE VALUES TO SEE HOW MUCH YOU CAN IMPROVE THE RMSE\n",
        "    learning_rate=0.00001,\n",
        "    steps=100,\n",
        "    batch_size=1,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I-La4N9ObC1x",
        "colab_type": "text"
      },
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "Xyz6n1YHbGef",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_model(\n",
        "    learning_rate,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear regression model of multiple features.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearRegressor` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "  \n",
        "  # Create a linear regressor object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_regressor = tf.estimator.LinearRegressor(\n",
        "      feature_columns=construct_feature_columns(training_examples),\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "  \n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(\n",
        "      training_examples, \n",
        "      training_targets[\"median_house_value\"], \n",
        "      batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(\n",
        "      training_examples, \n",
        "      training_targets[\"median_house_value\"], \n",
        "      num_epochs=1, \n",
        "      shuffle=False)\n",
        "  predict_validation_input_fn = lambda: my_input_fn(\n",
        "      validation_examples, validation_targets[\"median_house_value\"], \n",
        "      num_epochs=1, \n",
        "      shuffle=False)\n",
        "\n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"RMSE (on training data):\")\n",
        "  training_rmse = []\n",
        "  validation_rmse = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_regressor.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period,\n",
        "    )\n",
        "    # Take a break and compute predictions.\n",
        "    training_predictions = linear_regressor.predict(input_fn=predict_training_input_fn)\n",
        "    training_predictions = np.array([item['predictions'][0] for item in training_predictions])\n",
        "    \n",
        "    validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)\n",
        "    validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])\n",
        "    \n",
        "    \n",
        "    # Compute training and validation loss.\n",
        "    training_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(training_predictions, training_targets))\n",
        "    validation_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(validation_predictions, validation_targets))\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, training_root_mean_squared_error))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_rmse.append(training_root_mean_squared_error)\n",
        "    validation_rmse.append(validation_root_mean_squared_error)\n",
        "  print(\"Model training finished.\")\n",
        "\n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"RMSE\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"Root Mean Squared Error vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_rmse, label=\"training\")\n",
        "  plt.plot(validation_rmse, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_regressor"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i1imhjFzbWwt",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_regressor = train_model(\n",
        "    learning_rate=0.00003,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "65sin-E5NmHN",
        "colab_type": "text"
      },
      "source": [
        " ## Tarea\u00a05: Evalua los datos de prueba\n",
        "\n",
        "**En la celda a continuaci\u00f3n, carga los datos de prueba y eval\u00faa tu modelo con ellos.**\n",
        "\n",
        "Hemos realizado mucha iteraci\u00f3n en nuestros datos de validaci\u00f3n. Asegur\u00e9monos de no haber sobreajustado las peculiaridades de esa muestra en particular.\n",
        "\n",
        "El conjunto de datos de prueba est\u00e1 ubicado [aqu\u00ed](https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv).\n",
        "\n",
        "\u00bfC\u00f3mo se compara el rendimiento de la prueba con el rendimiento de la validaci\u00f3n? \u00bfQu\u00e9 indica esto sobre el rendimiento de la generalizaci\u00f3n de tu modelo?"
      ]
    },
    {
      "metadata": {
        "id": "icEJIl5Vp51r",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "test": {
            "output": "ignore",
            "timeout": 600
          }
        },
        "cellView": "both"
      },
      "source": [
        "california_housing_test_data = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\", sep=\",\")\n",
        "#\n",
        "# YOUR CODE HERE\n",
        "#"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yTghc_5HkJDW",
        "colab_type": "text"
      },
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "_xSYTarykO8U",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "california_housing_test_data = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\", sep=\",\")\n",
        "\n",
        "test_examples = preprocess_features(california_housing_test_data)\n",
        "test_targets = preprocess_targets(california_housing_test_data)\n",
        "\n",
        "predict_test_input_fn = lambda: my_input_fn(\n",
        "      test_examples, \n",
        "      test_targets[\"median_house_value\"], \n",
        "      num_epochs=1, \n",
        "      shuffle=False)\n",
        "\n",
        "test_predictions = linear_regressor.predict(input_fn=predict_test_input_fn)\n",
        "test_predictions = np.array([item['predictions'][0] for item in test_predictions])\n",
        "\n",
        "root_mean_squared_error = math.sqrt(\n",
        "    metrics.mean_squared_error(test_predictions, test_targets))\n",
        "\n",
        "print(\"Final RMSE (on test data): %0.2f\" % root_mean_squared_error)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    }
  ]
}
